Red Wine Quality Data Analysis Project by TIEN LE

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This dataset consists of 13 variables, with 1,599 observations.

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

This plot shows that the majority of wines in this dataset are scored at 5 or 6. The minimum quality score is 3 and maximum is 8 in this dataset.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

This plot shows that the many of wines sampled have a fixed acidity of between 7 and 9 g/dm^3. When facet by quality, most data seems to be in quality score of 5, 6 and some in 7. The plots faceted by quality score seems to share the same bell curve shapes as the plot for the entire sample. Therefore, it’s hard to indicate whether there’s any relationship here.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Similarly, this plot shows that the many of wines sampled have a fixed volatile acidity of between 0.4 and 0.8 g/dm^3. When facet by quality, most data seems to be in quality score of 5, 6 and some in 7. The plot seems to share the same shapes as the plot for the entire sample. Therefore, it’s hard to indicate whether there’s any relationship here. It’s worth noting that removing the outliers here shows a better picture.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

This plot shows that most wines have a citric acid level of less than 0.75.

## Warning: Removed 84 rows containing non-finite values (stat_bin).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

When we transformed the first plot (long-tail) to the second plot by cutting outliers, we can see that the majority of wines have between 1.5 to 2.5 grams of residual sugar per dm^3. While 75% of wines have less than 2.6 g/dm^3 of residual sugar, the extreme case goes up to 15.5 g/dm^3 of residual sugar. Most red wines aren’t sweet! ( Over 45 indicates sweet wines)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

This plot shows that almost all red wines have density of less than 1 g/cm^3

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 74 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

## <ggproto object: Class FacetWrap, Facet>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map: function
##     map_data: function
##     params: list
##     render_back: function
##     render_front: function
##     render_panels: function
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train: function
##     train_positions: function
##     train_scales: function
##     super:  <ggproto object: Class FacetWrap, Facet>
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

After transforming the original plot of wines by Chlorides to eliminate the long tail, we can see that this is a normal distribution. Most red wines (75%) have less than 0.09 g/dm^3 of chlorides.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

This plot looks like a normal distribution as well. Most wines (75%) have a pH of less than 3.4. The minimum pH is 2.74 and maximum pH is 4 for all wines in our data set.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

This plot looks left skewed with the level of free sulfur dioxide as low as 1 mg/dm^3 and as high as 72 mg/dm^3.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

This plot looks left skewed with the level of total sulfur dioxide as low as 6 mg/dm^3 and as high as 289 mg/dm^3.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 58 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 58 rows containing non-finite values (stat_bin).
## Warning: Removed 6 rows containing missing values (geom_bar).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The majority of red wines in this data set have less than 1 g/dm^3 sulphates.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 red wines in the dataset with 12 properties (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol percentage and quality). The variable quality is ordered factor variable with the following levels.

(very bad) —————-> (very excellent) quality: 0 -> 10

Other observations:

Most wines have a quality score of 5, followed by 6. The Median fixed acidity of red wines in this data set is 7.9 g/dm^3. A small number of wines have very high fixed acidity. Most red wines here have a volatile acidity between 0.2 and 0.8 g/dm^3 All red wines have a citric acid level of less than or equal to 1 g/dm^3. 75% of wines with citric acid level of less than 0.42 g/dm^3. Majority of wines have a residual sugar between 1.5 and 2.5 g/dm^3.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in my dataset is Quality. I’m evaluating what other variables contribute or correlate with Quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think volatile acidity, residual sugar, citric acid, pH, total sulfur dioxide and alcohol qill affect Quality of wines.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

No.

Bivariate Plots Section

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.26848392     -0.008815099
## fixed.acidity        -0.268483920    1.00000000     -0.256130895
## volatile.acidity     -0.008815099   -0.25613089      1.000000000
## citric.acid          -0.153551355    0.67170343     -0.552495685
## residual.sugar       -0.031260835    0.11477672      0.001917882
## chlorides            -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide   0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide -0.117849669   -0.11318144      0.076470005
## density              -0.368372087    0.66804729      0.022026232
## pH                    0.136005328   -0.68297819      0.234937294
## sulphates            -0.125306999    0.18300566     -0.260986685
## alcohol               0.245122841   -0.06166827     -0.202288027
## quality               0.066452608    0.12405165     -0.390557780
##                      citric.acid residual.sugar    chlorides
## X                    -0.15355136   -0.031260835 -0.119868519
## fixed.acidity         0.67170343    0.114776724  0.093705186
## volatile.acidity     -0.55249568    0.001917882  0.061297772
## citric.acid           1.00000000    0.143577162  0.203822914
## residual.sugar        0.14357716    1.000000000  0.055609535
## chlorides             0.20382291    0.055609535  1.000000000
## free.sulfur.dioxide  -0.06097813    0.187048995  0.005562147
## total.sulfur.dioxide  0.03553302    0.203027882  0.047400468
## density               0.36494718    0.355283371  0.200632327
## pH                   -0.54190414   -0.085652422 -0.265026131
## sulphates             0.31277004    0.005527121  0.371260481
## alcohol               0.10990325    0.042075437 -0.221140545
## quality               0.22637251    0.013731637 -0.128906560
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                            0.090479643          -0.11784967 -0.36837209
## fixed.acidity               -0.153794193          -0.11318144  0.66804729
## volatile.acidity            -0.010503827           0.07647000  0.02202623
## citric.acid                 -0.060978129           0.03553302  0.36494718
## residual.sugar               0.187048995           0.20302788  0.35528337
## chlorides                    0.005562147           0.04740047  0.20063233
## free.sulfur.dioxide          1.000000000           0.66766645 -0.02194583
## total.sulfur.dioxide         0.667666450           1.00000000  0.07126948
## density                     -0.021945831           0.07126948  1.00000000
## pH                           0.070377499          -0.06649456 -0.34169933
## sulphates                    0.051657572           0.04294684  0.14850641
## alcohol                     -0.069408354          -0.20565394 -0.49617977
## quality                     -0.050656057          -0.18510029 -0.17491923
##                               pH    sulphates     alcohol     quality
## X                     0.13600533 -0.125306999  0.24512284  0.06645261
## fixed.acidity        -0.68297819  0.183005664 -0.06166827  0.12405165
## volatile.acidity      0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid          -0.54190414  0.312770044  0.10990325  0.22637251
## residual.sugar       -0.08565242  0.005527121  0.04207544  0.01373164
## chlorides            -0.26502613  0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.07037750  0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456  0.042946836 -0.20565394 -0.18510029
## density              -0.34169933  0.148506412 -0.49617977 -0.17491923
## pH                    1.00000000 -0.196647602  0.20563251 -0.05773139
## sulphates            -0.19664760  1.000000000  0.09359475  0.25139708
## alcohol               0.20563251  0.093594750  1.00000000  0.47616632
## quality              -0.05773139  0.251397079  0.47616632  1.00000000

Based on this correlation statistics (pearson r), the two variables that have some significant correlation with quality are: alcohol (r = 0.476) and volatile acidity (-0.39). This means that alcohol is likely positively correlated with quality (the higher alcohol, the higher quality) and volatile acidity is negatively correlated with quality (the higher volatile acidity, the lower wine quality).

By plotting volatile.acidity against quality and colored data points by quality, we can see a trend that higher quality wines tend to have lower volatile acidity. However, we do see a bulk of wines with quality score rating of 5 and 6 that share the same volatile acidity level. Some of the wines with ranking of 8 have higher volatile acidity than those of 5 or 6 ranking as well.

When using boxplots, we can see that the mean volatile acidity decreases as the quality of wine increases.

This scatterplot of quality vs alcohol shows that higher quality wines tend to have higher alcohol level. We do see a few exceptions of high quality wines that actually have lower alcohol level than lower quality wines.

When using boxplots, we can see that the mean alcohol level decreases from wine quality of 4 to 5, but from quality ranking of 5 and up, the higher the quality ranking, the higher the alcohol.

## Warning: Removed 126 rows containing non-finite values (stat_smooth).
## Warning: Removed 126 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).

In this scatterplot of volatile acidity against citric acid, we do see somewhat a relationship here - the higher the citric acid, the lower the volatile acidity.

In this box plot, we seeing a similar trend of quality vs total sulfur dioxide with the trend of quality vs alcohol. After quality ranking of 5, the lower the quality of wine, the lower total sulfur dioxide.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
Wine with quality of 5 and above, the higher quality wines tend to have higher alcohol level.

Higher quality wines tend to have lower volatile acidity. Wine with quality of 5 and above, the higher quality wines tend to have lower total sulfur dioxide.

Did you observe any interesting relationships between the other features
Yes - the higher the citric acid, the higher fixed acidity.

What was the strongest relationship you found?

Fixed acidity and citric acid have a pearson r statistic of 0.67.

Multivariate Plots Section

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

This is a matrix of all graphs and charts for all variables.

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## # A tibble: 6 x 5
##   quality density median_alcohol median_density     n
##     <int>   <dbl>          <dbl>          <dbl> <int>
## 1       3 0.99471           9.80        0.99471     1
## 2       3 0.99476          10.90        0.99476     1
## 3       3 0.99600           9.95        0.99600     1
## 4       3 0.99660          10.70        0.99660     1
## 5       3 0.99705           9.70        0.99705     1
## 6       3 0.99808          10.20        0.99808     1

This graph plots alcohol agains density and colored by quality.

These graphs show volatile acidity against citric acidity by quality. We can see perhaps a negative correlation between volatile acidity and citric acidity, but there’s no clear indication of correlation between quality and citric acid variable. We see fewer wines with high volatile acidity as the quality goes up indicating a correlation there. Althought the distribution shapes are pretty similar between quality of 5 and 6.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

There seems to be correlation between alcohol level and density (higher alcohol level => lower density) and similar relationship between volatile acidity and citric acidity (higher volatile acidity => lower citric acidity).

Were there any interesting or surprising interactions between features?

Pretty interesting that there are only 2 standing out variables that are strongly correlated with Quality: Alcohol and Volatile Acidity. However, other features or variables have significant correlations with these two, influencing Quality as well.


Final Plots and Summary

Plot One

Description One

The majority of the red wines in our data set have a quality score of 5 and 6.

Plot Two

Description Two

According to this boxplot, volatile acidity decreases as the quality goes up, indicating a postive correlation here.

Plot Three

Description Three

This plot shows a potential correlation between alcohol and quality as the higher the quality, the dots seem to be on the further right of the graph, indicating higher alcohol.


Reflection

Through this exploratory data analysis, I could identify a couple of key variables that influence wine quality, which include: alcohol level and acidity. Other variables show long-tailed looking distributions. In my opinion, wine quality is a subjective measure however, there’s no standard calculation of wine quality - just individual ratings. Therefore, we see the strength of correlation as is. Further inferential study could be done to investigate such relationships in a deeper level.